Add self-test recipe for goose validation#5111
Merged
tlongwell-block merged 3 commits intomainfrom Oct 12, 2025
Merged
Conversation
zanesq
approved these changes
Oct 10, 2025
Collaborator
zanesq
left a comment
There was a problem hiding this comment.
Nice! Assuming this is cli only right?
Collaborator
Author
Yes, this one is. But @DOsinga and I were talking about using playwright to test the desktop app. Will try to explore that in a subsequent PR |
Collaborator
Author
|
cc @angiejones you might think this new feature is fun |
Collaborator
|
can you add it to the checklist @tlongwell-block that we run on a new release? |
zanesq
added a commit
that referenced
this pull request
Oct 13, 2025
…sion-streaming * 'main' of github.com:block/goose: (37 commits) Clear deeplinks after use (#5128) Revert "Fix gpt-5 input context limit (#4619)" (#5135) fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028) Fix bedrock tool input schema (#5064) Add self-test recipe for goose validation (#5111) fix: modifies openai request logic for reasoning models (#4221) (#4294) Fix race condition threat when set_param and set_secret of c… (#5109) Clean room implementation of the chat process (#5079) Bump rmcp (#5096) set version in an env variable for testing (#5100) fix : enhance fuzzy file search in goose desktop (#5071) Make async (#5126) docs: unlist tutorials for extensions with archived or moved servers (#5116) Add API Documentation Generator prompt (#5001) Add flag for enabling eleven labs voice dictation (#5095) force re-render fields to pick up custom params usage in instructions (#5112) Remove isUserInputDisabled (#5115) Improve Rust analysis output for `analyze` tool (#5072) Remove duplicate prepare_reply_context call (#5063) install react dev tools in development (#4979) ... # Conflicts: # ui/desktop/src/components/BaseChat2.tsx # ui/desktop/src/hooks/useChatStream.ts
katzdave
added a commit
that referenced
this pull request
Oct 15, 2025
* 'main' of github.com:block/goose: (49 commits) fixing video embed (#5171) chore: clean up random unused files (#5166) fix: adjust download_cli.sh to tolerate no OS variable (#5169) mcp tutorial page for firecrawl (#5152) Remove orphaned tool calls before compaction (#5059) feat: add copy as markdown button to documentation pages (#5158) chore: include vendored node executable (#5160) remove extra whitespace from message (#5159) Clear deeplinks after use (#5128) Revert "Fix gpt-5 input context limit (#4619)" (#5135) fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028) Fix bedrock tool input schema (#5064) Add self-test recipe for goose validation (#5111) fix: modifies openai request logic for reasoning models (#4221) (#4294) Fix race condition threat when set_param and set_secret of c… (#5109) Clean room implementation of the chat process (#5079) Bump rmcp (#5096) set version in an env variable for testing (#5100) fix : enhance fuzzy file search in goose desktop (#5071) Make async (#5126) ...
michaelneale
added a commit
that referenced
this pull request
Oct 16, 2025
* main: (35 commits) fix: include apple silicon build of the desktop app in build artifacts (#5174) fixing video embed (#5171) chore: clean up random unused files (#5166) fix: adjust download_cli.sh to tolerate no OS variable (#5169) mcp tutorial page for firecrawl (#5152) Remove orphaned tool calls before compaction (#5059) feat: add copy as markdown button to documentation pages (#5158) chore: include vendored node executable (#5160) remove extra whitespace from message (#5159) Clear deeplinks after use (#5128) Revert "Fix gpt-5 input context limit (#4619)" (#5135) fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028) Fix bedrock tool input schema (#5064) Add self-test recipe for goose validation (#5111) fix: modifies openai request logic for reasoning models (#4221) (#4294) Fix race condition threat when set_param and set_secret of c… (#5109) Clean room implementation of the chat process (#5079) Bump rmcp (#5096) set version in an env variable for testing (#5100) fix : enhance fuzzy file search in goose desktop (#5071) ...
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces
goose-self-test.yaml, a meta-testing recipe that enables goose to validate its own capabilities through first-person integration testing.What is First-Person Integration Testing?
Traditional testing approaches rely on external test harnesses, unit tests, or integration suites that examine a system from the outside. This recipe takes a different approach: it has a running goose instance test itself using its own tools and capabilities.
This is meta-testing - the system under test is also the tester, examining its own behavior from within an active session. For an AI agent like goose, this approach offers unique insights into behavioral consistency and tool reliability that external testing cannot provide.
Primary Use Case: Goose Testing Goose
The most powerful application of this recipe is when goose itself is developing new goose features. A goose instance working on the codebase can:
cargo build --releaseThis creates a recursive development loop where goose can autonomously develop, test, and validate improvements to itself. The goose doing the development can examine test outputs, debug failures, and iterate on fixes - all while using the self-test recipe to validate each iteration.
How It Works
The self-test recipe guides goose through a structured validation process:
The recipe uses goose's own capabilities to create test scenarios, execute them, and validate the outcomes. Each test phase builds on the previous, creating a comprehensive assessment of functionality.
Design Principles
What Can Be Tested
From within a running session, goose can test:
What Cannot Be Tested
Certain aspects require external observation:
The recipe focuses on what's testable from within, providing meaningful validation of user-facing functionality.
Key Features
Flexible Execution
The recipe supports parameterized testing:
test_phases: Select specific test categories or run alltest_depth: Choose between quick, standard, or exhaustive testingparallel_tests: Enable/disable parallel test executionworkspace_dir: Specify test artifact locationSelf-Documenting
The test generates comprehensive reports:
Clean Artifacts
Test artifacts are organized in a single
gooseselftestdirectory, which is automatically added to.gitignoreto keep the repository clean.Why This Matters
Continuous Validation
Provides a standardized method to verify goose functionality across different:
Behavioral Testing
Unlike unit tests that verify code correctness, this tests actual agent behavior - crucial for AI systems where behavior can vary with context and model.
Meta-Cognitive Assessment
The successful completion of self-testing demonstrates goose's ability to:
Quality Assurance
Enables rapid validation after:
Initial Validation
The recipe has been successfully tested with:
Results from initial testing: